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ft I Abstract 



For regularized estimation, the upper tail behavior of the random Lipschitz coefficient asso- 
ciated with empirical loss functions is known to play an important role in the error bound of 
■ Lasso for high dimensional generalized linear models. The upper tail behavior is known for linear 

models but much less so for nonlinear models. We establish exponential type inequalities for 
the upper tail of the coefficient and illustrate an application of the results to Lasso likelihood 
estimation for high dimensional generalized linear models. 
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^ ■ 1 Introduction 

in 

Let (Yi, Zi), . . . , (Yjv, %n) be independent random variables taking values in a product measurable 
space y X Z, with Y being regarded as response variables and Zi as covariates. In order to cover 
. both random designs and fixed designs, (Yi,Zi) are not necessarily identically distributed. A large 

class of Lasso type estimators for high dimensional generalized linear models can be formulated as 

o ' 



= argmhJ ^[ 7i (/i(^) T «, Y±) + b(v)} + £ A>,-| \ , (1) 



veDg 



i<N j<p 



where Dq ^ is a domain in M. p , ji(t, y) are a given set of real valued functions on R x J, oftentimes 
identical to each other, h = (hi, ■ . ■ , h p ) : Z —> W and b : Do — > K are given functions, and 
Ai,...,A p > are coefficients of the weighted l\ penalty on v. In this article, we only consider 
nonadaptive Lasso, in which Ai, . . . , \n are fixed beforehand. 

Under the setting of ([1]), for each v € Dq, we have N loss functions, each defined as (y,z) — > 
7i(ft.(z) T w, y) + b(v). The corresponding empirical losses are 7i(ft.(Zi) T w, Y) + b(v), and the corre- 
sponding expected total loss is 

L(t.) = ^E[ 7l (^(Z i ) T «,Y) + 6(i;)], veD . (2) 

i<N 

As the title suggests, the main interest of the article is the so called "local stochastic Lipschitz" 
(LSL) condition. By LSL we mean the following. For the time being, denote by 

L(v) = J2hi(HZi) T v,Yi) + b(v)}-L(v) 
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the fluctuation of the empirical total loss from its expectation at parameter value v. Let 9 € R p be 
fixed. Under smooth conditions for 7^, it is easy to see L(v) is differentiable with probability (w.p.) 
1, which in general leads to Lipschitz continuity of L(v) provided Do is compact. The LSL condition, 
on the other hand, refers to a bound on the upper tail probability of the random variable 

veD , v^8 l^j< p - Vj\ 

Note that the LSL condition is with respect to a weighted l\ norm of W . The condition is called 
"local" because 9 is fixed, even though its value is typically unknown. 

Although it might not be apparent at this point, the LSL condition is closely related to the 
issue of estimation error for Lasso. For linear regression with square loss function {y — h(z) T v) 2 , 
this relationship is well known and has been regularly employed to obtain estimation error bounds 
[U |31 [HI]- Indeed, in this case, due to linearity, the LSL condition is rather easy to establish. However, 
for other loss functions, the LSL condition is much less clear and, to my best knowledge, has not been 
fully explored. An alternative to the LSL condition is a convexity assumption, in which ~fi(t,y) is 
convex in t and b(v) is convex in v. The convexity assumption allows a linear interpolation technique 
to be employed to yield upper bounds for estimation error |12j . While the convexity assumption 
allows for nondifferentiable 7.;, it is not clear how the technique can be extended to nonconvex loss 
functions. 

We shall establish the LSL condition for general loss functions. For differentiability, we only require 
that "fi(t,y) be first order differentiable in t with the partial derivative being Lipschitz. After getting 
various results on the LSL condition, we will then illustrate an application of the LSL condition to 
Lasso type nonlinear regression, by finding an upper bound for the £2 norm of estimation error. 

Previously, in [6], the LSL condition was studied for loss functions of the form (y — gi(h(z) J v)) 2 , 
i < N, where gi : K — > R are nonlinear. The condition was established under the assumptions that gi 
are twice continuously differentiable and 

Y i =g i (h(Z i ) T 6) + e i , (4) 

where £j are uniformly bounded zero mean noise. In this article, we extend the result on two aspects. 
First, the LSL condition is established for general ^(t, y), while still under the assumption of uniform 
boundedness. Second, it is established for Q when are Gaussian. Whereas the bounds for general 
7i(£, y) is of Bernstein type, the bounds for the Gaussian case is of Hoeffding type. In [6], a truncation 
argument was suggested for the Gaussian case. However, the LSL condition obtained in this way is 
not as tight as the one to be obtained here. The tools used to get the results on the LSL condition 
are various measure concentration and comparison inequalities in Probability El [7] ■ 

Section [2] presents several results on the LSL condition. The discussion in the section is actu- 
ally more general. It provides upper bounds on the tail probability of the remainder of the Taylor 
expansion of L(v). The LSL condition is a simple consequence of these bounds. 

In Scction[3l we consider an application of the LSL condition to Lasso. Besides the LSL condition, 
Lasso involves another issue, that is, the amount of separation of v and 9 based on the difference 
between 7i(/i(Zi) T «, Fi) and j t (h(Z l ) T 9, Y t ). This issue is of different nature from the LSL condition, 
and its resolution in general requires further conditions on the matrix [hj(Zi)]i<N,j<p- The issue has 
been studied in quite a few works [14[ O IHl H H3J . For transparency, we will use a restricted eigenvalue 
condition in pQ for our purpose. We will consider an example of Lasso type MLE for high dimensional 
generalized linear model and apply the LSL condition to bound the £2 norm of the estimation error. 
Unfortunately, the method of the example gives no clue on model selection or more elaborate bounds 
similar to those obtained for linear models under square loss [Ml [TJ 0] . All the proofs are presented 
in Section |H 
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1.1 Notation 

For q £ [l,oo), denote by ||a|| g the l q norm of a € M. d . For two vectors a = (a 1; ...,a m ) T and 
b = (pi, . . . , b n ) T , recall that their tensor product is 

a (g) b = (aib T , . . ., a m b T ) T = (a x bi, . .. , a\b n , • • • , a n bi, . . . ,a m b n ) T € E mn . 

Denote by v® k the tensor product of k copies of v. 

If / is a function on a domain ft C M. d , then it is Lipschitz (under the Euclidean norm) if 

II/II L1P := sup ^M<oo. 
p xjtyen \\x-yh 

Finally, for any random vector X, denote its deviation from mean by 

fX] = X - EX. 

By linearity of expectation, [X + YJ — [X] + [y] . By this notation, 

L(v) = J2h(h(Zi) T v,Y i )]. 

i<N 

The right hand side is independent of b(v) and at the same time better reveals the other quantities 
involved. We will discard the notation L in favor of [H for the rest of the article. 



1.2 Notes 

The methods in Section [2] can be used with little change to deal with the following additive mixture 
of loss functions, 

i<N k<q 

where for each k < q and i < N , hk = (h^i, ■ ■ ■ , hk p ) is a function from Z to W , and 7^. is a loss 
function. For example 

i<N i<N 

is a special case of additive mixture, where Zi and Zi are covariates that may be identical or have 
completely different sets of coordinates. Due to identihability issue in the context of parameter 
estimation, such mixtures will not further considered in the article. 



2 Local stochastic Lipschitz condition 

In this section, we present exponential bounds on the tail probability of the random local Lipschitz 
coefficient ([3]). As noted earlier, these bounds are consequences of more general results on the tail 
probability of remainders of Taylor expansion of random functions. Therefore, most of the discussion 
below will be on the latter and the results on the LSL condition will be given as corollaries. 

2.1 General loss function 

Suppose 71, ... , 7at satisfy the following regularity condition. 



3 



Assumption 1 (Regularity). There are m £ {0, 1,2,.. .} and — oo < at < bi < oo, i < N, such that 
w.p. 1, each Ji(t, Yi) as a function oft is m times differentiable on (<2j,6j) with the m-th derivative 
being bounded and Lipschitz. Let F m , F m+ i be constants such that w.p. 1, 



<9i" 



< F„ 



9 m 7i (i,r 2 ) d m lt {t',Yi) 



Vt,t' £ (Oi.&i), i < N. 



< F m+1 \t -t'\. 



dt m dt m 
Suppose h satisfies the following condition. 
Assumption 2 (Boundedness) . There are constants d%, . . . ,d p £ (0,oo), such that 

Primax |^(^)| < d 3 , Vj < p\ = 1. 

Next, let D ^ be a domain in W p . 

Assumption 3 (Parameter Domain). For (aj,i><) as in Assumption]]] and h as in Assumption^ 

Pr {h(Zi) T v £ (a u bi), V« £D , i<N} = 1. 

From Assumption [T] and dominated convergence, differentiation and expectation can be exchanged 
for ji(t,Yi), i.e., 



d k ~f t (t,Y) 
dt k 



dt k 



, t £ (aj, bi), i < N, k < m. 



By Assumption O \hj(Zi)/dj\ < 1 w.p. 1. Therefore, dj can be thought of as the "scales" of the 
functions hj. 

Theorem 2.1. Under Assumptions^ -0 fix an arbitrary 9 £ Dq. Then for v £ Do, 

£hW) T «,iS)] 



i<N 



k<m ' i<N 



^ kl ^ 

k<m i<N 



d k lt {h{z l )~ T e,Y i ) 

dt k 



h(Zi) T (v - e)] k 



\j<p 



(5) 
(6) 



where {£(v), v £ Do} is a process that has the following upper tail property 

Prisup \£(v)\ > Ay/2ln(2p) + B^2\n(p m /q) + C\n(p m /q)\ < q, Vq£ (0,1) 
Ue-Do J 

with A, B , and C being set as follows. First, let 

2F m F m+ \R 



R= sup y^dj\uj -Vj\, 



u,v£Dq 



3<V 



m\ ' (m + 1)! 



F m+ i/m\ ra/1 
F m+ i/2 m = 1. 



(7) 



The 



A = 8^i?E^ /max ^ [ft, (Z,) /dj} 2 , B = </>JE max ^ [hj (ZA/dj} 2 ™, C 



where in the definition of B the convention x° = 1 is used for m = 0. 
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Note that if F rn+ i > 0, then the above result is meaningful only when R < oo, that is, Dq is 
bounded. On the other hand, if w.p. 1, for i < N, Ji(t, Yi) is a linear function of t, then one can set 
F m +i = 0. By Theorem l2.il this yields A = B = C = 0, which implies £(v) = 0. Of course, the last 
fact is easy to be seen by the linearity of ji(t, Yi). 

Of particular interest is the case where m = 1. From Theorem 12. 1[ the following result obtains. 



Corollary 2.2. Under Assumptions^ -[3] with m — 1, fix an arbitrary € Dq. Then for v € Dq, 



3<P 



(8) 



i<N i<N 

where £(v) is as in Theorem \2.1\ and £i is a random variable with the following upper tail property 

PrjlCil > Fi V2jVln(2p/«)} < 1, Vge(0,l). 



Since 

16 1 



sup > sup = — 

veD veD o ,v^0 l^j<p A J \ v i - V] I 



£ l 7i (h(Z i ) T v,Y i )--y i (h(Z i ) T 9,Y i }} 



■<N 



from the result, we then get a desired form of the LSL condition. For any q, q' € (0, 1) not necessarily 
equal, one can find M(q,q'), such that w.p. at least 1 — q — q', the random local Lipschitz coefficient 
on the right hand side is no greater than M(q,q'). Moreover, one can set 



M(q, q 1 ) = A^2\n{2p) + B^2ln(p/q) + C ln(p/q) + F x yj2Nln(2p/q>), 
with A, B and C given as in Theorem 12. II with m = 1. 



2.2 Gaussian case 

Suppose Z\ , . . . , Zm are fixed and 

Yi = Hi — u>i 

where [ii are some unknown constants, and u>i, . . . , wjy are independent square- integrable random 
variables with mean 0. Let /i,...,/jv : K — > R be a set of transforms specified beforehand, and 
h = (hi, . . . , hp) : Z — > W a measurable function. Suppose the goal is to use fi(h(Zi) T v) to 
approximate \Xi under the square loss functions 

li {t,Y i ) = {Y i - mf/2. (9) 

For any v, provided that h(Zi) T v is in the domain of fi for all i < N, 

{^{hiZ^v, Yi}} = Ifa - Ui - f t (h(Z t ) T v)) 2 - ~E[(jn MhiZ^v)) 2 } 

= ^[MhiZ^v) - Mi] + - Var(wi)]. 

Thus, for any 9, provided that h(Zi) T is in the domain of /, for all i < N as well 

£ [ 7l (M^) T ^)] - £ [ 7l (M^) T ^)l = [/i(M^) T tO-/i(W T 0)] ■ 

i<JV i<Af i<JV 

As a result, we will focus on the expansion of the random function 

v -> 2J Uifi(h(Zi) T v) 

i<N 

around any fixed # G Dq. 
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Assumption 4 (Regularity). There are m £ {0, 1,2,.. .} and — oo < aj < bi < oo, i < N, such that 
each fi is m times differentiable on (ai,6j) with the m-th derivative being bounded and Lipschitz. Let 



F m = max sup |/v (m) (t)|, F m+ i = max 
*< N te(a„b,) ^ N 



J i 



Lip 

Since Zi are fixed, Assumption [5] is no longer needed. Instead, simply define 



d.j = max \ hj(Zj) 

3 i<N ' 3K ' 



Also, modify Assumption [3] as follows. 

Assumption 5 (Parameter Domain). The domain Dq ^ of candidate parameter values satisfies 
hiZ^v £ K, h), Vv £D ,i<N. 

In [5] , the case where Wj are uniformly bounded is considered. Here we shall deal with the following 
situation. 

Assumption 6 (Gaussian). wi,...,ojjv are independent Gaussian variables with Var(cji) < 
N , where o~q £ (0, oo) is a constant. 

Theorem 2.3. Let the loss functions 71,..., 7^ be as in Under Assumptions^ - fix an 

arbitrary 8 £ Dq. Then for v £ Dq, 

i<N 

= z2^ l^<*f} k \KZ i ) T 9)[h(Z i ) T (v-0)] k ) +£(«) (^^K-^l) 

k<m ' \i<N J \3<P / 

k<rn ' \i<N I \j<P 

where {£(«), V £ Dq} is a process that has the following upper tail property 

Prjsup > (To(^a/M2p) + B V / 21n(p™/g))i < q, Vq G (0,1) 

Ug-Do J 

witt A and B being set as follows. First, set R, <j> and i/> as in (J7J. TTien 



(10) 
(11) 



A = 8^i? /max V[/ij(Zi)/dj] 2 , B = /max 
where in the definition of B the convention x° = 1 is used for m = 0. 



2m 



Comparing to Theorem 12.11 the above upper tail bound does not have a term of the form 
C\n(p m /q). This is because in the Gaussian case, we can get a Hoeffding type inequality for the 
upper tail instead of a Bernstein type inequality. 

From Theorem 12.31 the following result for the case m = 1 obtains. Note that the result is not 
entirely the same as Corollary [ 



Corollary 2.4. Under Assumptions^ - [S| with m — 1, fix an arbitrary 9 £ Dq. Define positi 
constants w± , . . . , w p as 



= c- - 2 J2Var{u} i )h j {Z i ) 2 . (12) 



i<N 
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Then for v G Dq, 

Uift(h(Zi) T v) = u Jl f l {h{Z l ) T 6) + aoF^i w i\ v 3 }2 d o\ v 3 - d A ( 13 ) 

i<N i<N j<p j<p 

where : v G Dq} is as in Theorem \2.3\ and £1 is a random variable with the following upper tail 

property 

Pr{|a| > y/2ln(p/q)}<q, V«G(0,1). 

Similar to Corollary [221 the above result can be used to get the LSL condition. For example, for 
any q, q' G (0, 1), one can set 



M(q, q') = ao [Ay/]n(2p) + B v / 2\n{p/q) + F 1 y/2]n(p/q') 

with A and B given as in Theorem 12.31 with m = 1, such that w.p. at least 1 — q — q' , the following 
random local Lipschitz coefficient 



1 

sup 



J2^[Mh(Z l ) T v)-f l (h(Z i ) T 9)} 



i<N 



is no greater than M(q,q'), where Xj = max.(wj,djj 



3 An application to high dimensional Lasso 

Under Assumptions [T] - [3l we consider the case where Z\,... ,Zn are fixed. For simplicity, assume 
d\ = . , . = djsi = d in Assumption [2] Consider the following Lasso functional 

6 = argmin J V 7i (/ l (Z i ) T «, Y t ) + Xd\\v\\x \ , (14) 

where A > is the tuning parameter. Suppose Dq is compact so that the minimum is always obtained. 
The goal is to have 9 approximate to 6, where 

6 = argmin £ £[ ll {h{Z i ) T v^)]. 

" eD « i<N 

We next consider applying Corollarv l2.2l to bound \\6 — 0\\2- Denote Xi = h(Zi) and X the N x p 
matrix with Xj as the i-th row vector. The total expected loss function now can be written as 

L{v) = Y,^(Xjv,Y l % veD . 

i<N 

Denote by spt(u) = {j < p : Vj ^ 0} and by \\v\\o the cardinality of the set. In general, in order to 
bound || — 9\\ 2, some conditions on X are needed in order to get a bound in terms of the li norm of 
v — 6 (cf . p~3j IH Q] ) . For transparency, we use a "restricted eigenvalue" condition formulated in [1] , 
which says that for some 1 < s < p and c > 0, 

k{s,K) := mini jjf* 11 ' : 1 < | J\ < s, v + 0, < # |M|i| > 0. 

iVN\\vj\\ 2 J 

To see where the LSL condition is to be used, we first summarize an argument that has been more 
or less used for special cases of Lasso (cf. [HI H]). Note that the argument does not lead to model 
selection or more elaborate bounds that have been obtained especially for linear models under square 
loss [5JEE31EEJI4]. 
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Theorem 3.1. Suppose the following conditions are satisfied. 

1) For some K > 1, 

«:=/s(2||fl|| ,if)>0 (15) 

2) For some C 1 > 0, 

L(v) - L(6) > C-y\\X(v - 9)\\l, VveD . (16) 

3) Given q £ (0, 1), suppose there is M q > 0, such that w.p. at least 1 — q. 



i<N 



<Af,d||0-0||i. (17) 



Then, by setting 



(K + l)M q d 

K-l (18) 



in the Lasso functional (|14[) . on the event that (| 1 7[) holds, 

_ M q VWo x 2V2T^Kd 

11 11 N C 7 k 2 (K-1) V ; 

Theorem 13.11 has three conditions. The first one is the aforementioned restricted eigenvalue con- 
dition. In some cases, the second condition is easy to establish. The third condition is the LSL 
condition. By Corollaries 12.21 and 12. 41 M q can be set reasonably small, ideally of order or even 
smaller. 

Example 3.1. Let y be a Euclidean space and T = {f(y \ t) : y E y, t £ [a, b}} a family of densities 
on 3^, where — oo < a < b < oo. Suppose given Zi, the density of Yi is 

f{y\xj&) 

where 9 is the parameter and Xi again is h(Zi). Suppose it is known that 9 £ Dq, where Do C MP is 
an open bounded region such that for v £ Do, Xjv £ [a,b] for each i < N. Then any solution 9 to 

mm with 

7i(*,v) = - la /(»!*):= *<N (20) 



is an i\ regularized MLE of 9. Suppose X satisfies (fTSj). We next find some conditions in order for 
(fl7J|) to hold. Let I{t) denote the Fisher information of T at t and 

D(t, s) = J f(y 1 1) In dy = E[£(s, Y)] - E[£(t, Y)], Y ~ /(„ 1 1), 

the Kullback-Leibler distance from f(y \ s) to f(y \ t). For J 7 with enough regularity, it is not hard to 
show D has the following properties: 

1) D, dD/ds, d 2 D/ds 2 are continuous in (t,s); 

2) D(t,t) = (dD/ds)(t,t) = 0, I(t) = (d 2 D/ds 2 )(t,t) > 0; 

3) every t £ [a,b] is identifiable in J 7 ; and 
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4) as h -> 0, D(t, t + h) /h 2 -> J(i) uniformly for f e [a, 6]. 

Property 2) implies that for s in a neighborhood of t, D(t, s) > I(t)(t — s) 2 /2. Together with the 
other three properties and the compactness of [a,b] x [a, 6], for some Cjr > 0, D(t,s) > Cjr{t — s) 2 
for all t, s. Now for i < N and u € Do, since Y has density /(y | 

E^i^^Y,)] - EfoOY^.Yi)] = D(Xj 0, Xjv) > C^\Xj 6 ~ Xj v\ 2 . 

Then by the definition of L(v), 



L(v) - L{6) >C^ \ X J(v d )\ 2 = C M\X(v 



II, 



i<N 



so (fTBj) is satisfied. 

Finally, if 7$ defined in ([20]) satisfies Assumptions Q] - [3l then by Corollary 12.21 and Theorem 13.1 
given qi,q2 € (0, 1) with q\ + q-2 < 1, the following bound 



j> (Mi +M2)-v/p|io" 2>/2+7r?irrf 

holds with probability at least 1 — q\ — q 2 , where Mi and M 2 are as follows. Denote by Vi, . . . , V p 
the column vectors of X and 

A = sup \\u — v\\i. 

Denote 



Ft = ess sup sup max 

1 t i<N 

Note d\ = . . . = c?jv = d. Then 



e(t,Yi) 



F 2 = ess sup max 

\i<N 



Lip 



Mi = Av/21n(2p) + Bv/21n(p/qi) + 801n(p/gi), M 2 = F x ^2Nln(2p/q 2 ), 

where 

A = 4 J F 2 Amax||y i || 2 , B = (F 2 /2)Amax|| ^|| 2 , = min(2F 1 , F 2 dA/2). 

Up to a factor of y/\a(p / q 2 ) , M 2 — 0{\/~N). Typically, for well designed X, maxj< p ||T^-|| 2 = 
0(y~N). Therefore, Mi = 0(y/~N) up to a multiplicative factor y/\n(p/q\) and an additive remainder 



of order ln(p/qx). As a result, \\8 — 8\\ 2 is of order y/\\8\\o/N up to factors much smaller than yN 
unless p is extremely large. 

Similar conclusions can be made if f(y | XjO) is the density of N{Xj 9, Cq). In this case, we can 
use Corollary |2.4l For brevity, the detail is omitted. □ 



4 Proofs 

In this section we give proofs for the results in previous sections. First, recall that for q £ [1, 00), 

l|a®6||| = IHI|l|6|II. (21) 
and for ai,a 2 € W n , b u b 2 e E™, (aj a 2 )(bjb 2 ) = (01 ® 6i) T (a 2 ® 6 2 ), giving 

( a 7a 2 ) fe = (af fc ) T (af ). (22) 
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4.1 Proofs for Section [2] 



Proof of Theorem \2.1\ By (|22l) . (J5J) and ^ are equivalent. For notational brevity, we shall avoid 
explicit use of dj. For this reason, the domain D is not the one to be directly worked on. Rather, 
we shall consider 



D = {(divi,. . . , d p v p ) T : v E D }. 



(23) 



In other words, D is the image of Do under the 1-1 transform T : v — > (diVi, . . . ,d p v p ) T . We shall 
use the i\ norm on D. Note that the norm induces a weighted £± norm on Dq as 

\\u-v\\ = ||Tu-Tw||i = ^2dj\uj -Vj\, 

j<p 

which is the reason why J2j< P dj\ u j — 0j\ appears in the expansions ([S]) and Moreover, R in (J7J 
can be expressed as the diameter of D under t\ , 

R= sup || it — u||i. 
u.veD 



(24) 



Based on the same consideration as (1231) . denote for i < N , j < p, 

Xij = hj(Zi)/dj, Xi = (Xn, . . . ,Xi p ) T , Vj = (Xij, . . . , Xj\j j ) 
Then Assumption [2] on the boundedness of hj{Zj) implies 

Pr{\X ij \<l,Vi<N,3<p} = l. 



(25) 



Furthermore, for v E D, let u E Dq such that Tu — v. Then Xjv = h(Zi) T u, so we can easily 
translate an expansion in terms of Xj v into one in terms of h(Zi) T u. Therefore, until the end of the 
proof, we will focus on D. 

For brevity, for each % < N, denote 



k < to + 1. 



Fix 9 E D . For i < N and v, define random vectors c = (ci, . . . , cn) and t — (fi, 

Ci = XjO, ti = Xj{v-6). 
For i < N, let tpi be the following random function on R, 



. , tjy) with 



/<(<*+*)-£ 



fc<r 



fc! 



MO; 
t = 0. 



(26) 



We need the following property of <pi. 

Lemma 4.1. M^.p. each ipi E C{ai — c,,6j — Cj), and 

2-Fm ^m+i|i| 



|Vi (*) | < min 



m! (to + 1)! 



(27) 



and \\ifi\\ Up < ip, where 



F m+ i/m\ m/1 
F m+ i/2 m = 1. 
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Lemma R~T1 will be proved later. Clearly, 



E lAxjv, Yi) = e Mci + *o = E E 



i<N 



i<N 



i<N \ k<rn 



k\ 



E^fE/i%)*?)+Ew(^ 

k<m ' \i<N J i<N 



where, by Assumption [21 w.p. 1, U = Xj{v -6)& (oj - Ci, bi - Ci), Vi < N, v € D. Then by ([22]), 



E7^^) = E^(e/J%)*?*) ( 

i<7V fc<m ' \i<7V y 



\07H 



,i<N 



Therefore, 



E h(xjv,Y t )} = E^E l/i fc) ^)*rj ' (« - 

i<N k=l ' i<AT 



T 



J2l^(U)Xf m f(v-er m . (28) 



i<JV 



By Holder inequality and ([21)) , 



< 



E ymxf m i 



i<N 



\\v-e\\™ 



(29) 



For each j = (ji, • ■ ■ ,j p ) with j s < p, denote 



X-ij X%2\ ' ' ' Xij m , 



where the product on the right hand side is defined to be 1 if to — 0. Then the coordinates of Xf r< 
can be written as Xy, with j sorted, say, in the dictionary order. Let 



Zj = sup 



v£D 



E IVii^Xi: 



i<N 



Then from (j29l. 



i<N 



< \\v - 0\\? max. 



E 



<\\v-0\\?maxZ 3 . (30) 



By (gSJ, w.p. 1, |Xy| < 1, i < JV, j < p, and 



\Xj{v - 0)| < \\v - < i?. Then by 



Lemma 14. 1[ 



It follows that 



|<Pi(ti)| < min 



2-^m *m+l-^ 



to! (to + 1)! 



< 6 ^ 2< ^ : = A/ o, Vj, w.p. 1. 



(31) 



Observe that given v £ D, for each i < N, ip^t^Xij is a function only in (Y^, Zj). Therefore, by 
independence, for to > and v <E D, 



Var | E 



E Varfoft)**) < E E [w(*i) 2 *£] < 



E 

i<JV 
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If to = 0, then the right hand side is N(f> 2 . If to > 1, by Young inequality, 



£^ = E*4"-*L< II 



2m 



i<JV 



i<N 



3<m \i<N 



Therefore, 



Var J2 ^(U)X VJ < Si := 



<t) 2 N 



II H^IIL< max 11^-11^. 

s<m 



TO = 



v i<JV 



^[max^pH^-lll-] to>1. 



(32) 



Fix one j — (ji, . . . , j p ). We next combine (|31l) and (|32p with measure concentration to bound the 
upper tail of Z 3 . Again, note that given v, ipi{ti)Xi 3 is a function only in (Yl, Zi), with ti — Xj (v — 9). 
Let 

T = {r = « a , ...,r v N a ):veD, a G {-1, 1}}, 
be a collection of functions parameterized by D x {—1,1} mapping (y x Z) w into K , such that 

rljY t , Zi) = aM^ 1 l^X^l , i< N. 
Then Z 3 = M Z, S 2 = M 2 S 2 , with 



Z = sup J2 ^(Y^Zi), S 2 = sup Var ^ r l (r,, Z, 



<N 



,i<N 



From (|3"Tj) . for w G D and a = ±1, a e [—1, 1]. Clearly, Er* a (li, ^) = 0. Furthermore, w.p. 1, 
a (Yi, Zi) is continuous in v. Therefore, by dominated convergence argument, Theorem 1.1 in [7] 
can be applied to Z. Let w = 2EZ + S 2 = 2EZj/M + S 2 /M?- Then by [7], 

Pr{^ > EZj + Mqo} = Priz > EZ + a\ < expj- - 1 , Va > 0. 

v J I 2/UJ ~\~ tjCL J 

F or s > 0, a = (1/2) (3s + V9s 2 + 8sw) is the unique positive solution to a 2 /(2w + 3a) = s. Using 
Va + b < \fa + yb and 2\fab < a + b, 

EZj + M a < EZj + (M /2)(3s + V^ 2 + V&sw) 



EZ 3 + M ( 3s + J2s(2EZ 3 /M + S 2 /M 2 ) 



Then 



< EZ, + M (^3s + sJlsEZjMo + ^2sS 2 /M 2 

< EZ 3 + M (4s + EZj/Mq + (S* /M )V^). 

Pr lz 3 > 2EZj + S'oV^i + 4M s| < e _s . 



(33) 



To find an upper bound for EZ 3l let e%, . . . ,£/v be a Rademacher sequence independent of (Yi, Zi). 
By symmetrization inequality (cf. [9], Lemma 6.3) 



EZ, < 2E sup 

v£D 



£i<pi(ti)x i:i 



■<N 



(34) 
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By Fubini Theorem, the expectation on the right hand side is 



E Xj yE e sup 

vED 



y^ Ei<Pi{ti)Xi 



i<N 



where Ex,y denotes the expectation only with respect to the (marginal) distribution of {X\, Y\), . . . , 
(Xn,Yn), and similarly for E £ . 

From (|2"5|). <pi(0) = 0. Assume ip > first. Given (Xi,Yi), (X N ,Y N ), by Lemma ETT1 and 

f -> (pi(t)x i3 /i) 

is a contraction for each i < N. Meanwhile, we can write 



E e sup 

vED 



^2 £i^i{U)X t 



i<N 



E E sup 

teT 



^ e l ip i {U)X K 



\<N 



with T = T(Xi, . . . , X/v) = {(ti, . . . , tj\r) : i< = -^i ( u — 0)j w € -D}. Then by a comparison inequality 
(cf. Theorem 4.12 in [5]), 



E £ sup 

vED 



y^ £i<Pi(u)x^ 



: .<N 



< 2?/>E e sup 



i<JV 



Using = Xj (v — 9) and by the same argument for (|2l?|) 



E e sup 

teT 



E 6 sup 

ue-D 



i<JV 



< E £ sup 



i<JV 



Ik -f 111 < i?E e max|e T V, | , 



where e = (ei, . . . , Sn) t . With (Xi, Yi) being fixed, by a result in [ID] (Lemma 5.2), 



E e max|£ T ^| < v/21n(2p)max||^|| 2 . 



Combining the inequalities and taking expectation with respect to (Xi,Yi] 



E sup 

vED 



^ e i ip l {t i )X i 



■<N 



max 11^-112 

3<P 



(35) 



If ip = 0, then (/J, = and the above inequality holds trivially. Combining (1331) - (135]) yields 
Pr{z 3 > Mi ^2 ln(2p) + V2S y/s + 4M s\ < e~ s . 

where 



max \\Vj || 2 



(36) 



Mi = 8i/jRE 

Finally, since there are p m different values of j, by union-sum inequality, 

Pr | max Z 2 > Mi 0o(2p) + \/2S* y/]n(p m /q) + 4M ln(p m /g) 1 < ?) V ? € (0, 1). (37) 
Note Mi, Sq and 4Mq are exactly A, B and C in Theorem l2.ll Then by (f3T7|) . the proof is complete. □ 
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Proof of Corollary \2.2\ As in the proof of Theorem 12. 11 we consider the domain D in (f2"3f and Xi in 
(gl]) . Still denote a = X?6. For m = 1, by ThcorcmO for 6, v e D 

iMxjv)} = £ m*)i + E uiwxiV (v-b)+ t(v)\\v eh. 

By Holder inequality, 



z<iV 



i<N 



< || 17 — 8\\i max 



E lf'M) x i 



i<N 



Given j < p, [/,'(ci)A"y] are independent with mean 0, and each \ f' i {a)Xij\ < F\. Therefore, by 
Hoeffding inequality ([H], p. 191) and union-sum inequality, 



Pr < max 



E 



i<N 



> t > < 2pexp <^ - 



2JV.F? 



Given g G (0, 1), let t = VNFiy/2\n(2p/q) to get the right hand side no greater than q. Combining 
this with the bound for the proof is complete. □ 

Proof of Theorem \2.S\ . The proof is similar to that of Theorem l2.11 so we will be brief. Define domain 
D as and Xy, Xi, Vj as in Let c = {c\, . . . , cat) and t = (ti, . . . , ijv) with 

Define ipi as in (|26|) . however, note that the meaning of /j is different here. In particular, fi are 
nonrandom and hence ifi are nonrandom as well. In spite of this, Lemma [4.1l still holds. Corresponding 
to (EH), 



Y J ^ifi{HZ i ) T v) 



i<N 



= E h { S <*fi k) w x ? k ) ( w - ^ fc + (e <«w(*o*f m l (« - *)® ro 

fc<m ' \i<iV / \i<N J 

The next step is to bound the upper tail probability of max., Zj, where for J = (ji, • ■ ■ ,j m ). 



Zj = sup 



veD 



E Ui<Pi(ti)Xi 



i<N 



LO = (uii, . . . ,0J N ) 



Write = aiSi, where of = Var(wi) < ctq an d £i, • ■ • , £n are i.i.d. ~ N(0, 1). Fix one J, Then 



Z 3 = Z(e) = sup 



i<JV 



, £ — (ex, . . . , £jv) t . 



The function Z is Lipschitz on R under the Euclidean norm (£2 norm), because for a, b € 



piV 



|Z(a) - Z(6)| < sup 



)(oi - bi)<Ti<pi(ti)Xij 



i<N 



< \\a - b\\ 2 (T Q S Q , 
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where, as in (j3"2")l . 



4> 2 N 



m = 



l^rnax^H^lli™ m > 1. 
Now by a concentration inequality for Gaussian measure ([§], p. 41) 

Pr{Z(e) > EZ(e) + rcroS'o} < exp(-r 2 /2), Vr > 0. 



(38) 



By Lemma H. II and \Xij\ < 1, t —> ipi(t)Xij/tp is a contraction with being mapped to 0. Then 
by a comparison result for Gaussian process ([9], Corollary 3.17 and (3.13)) 



EZ(e) < Aa ipE sup 

v£D 



= 4(7o?/' E sup 



^e^ T ( w -0) 



i<N 



< 4croi?V'Emax|e T V 7 | , 

3<P 



and 



Emax £ Vj \ < 3\/lnpmax J Var(e T t/ ? ) = 3-\Z m P max II II 2- 

j<p j<p V ' j< p 



Using an argument in |10| , one can get a bound for the expectation that is tighter for large p. 
Lemma 4.2. There is 

Emax|e T V,| < 2 v /ln(2p)max||V,j| 2 . 

3<P 3<P 

Now (|38|) can be written in terms of Z 3 . Then, as in (|37|) . for g£ (0, 1), 

Pr I max Z } > a (m x y/\n(2p) + ^2 ln(p m /g)5 ) \ < q, 



(39) 



where 



Mi = 8Ri>m&x\\Vj\\2. 

3<P 



This then finishes the proof. 

Proof of Corollary \2.4\ From Theorem 12. 3[ it is seen that 



where 



□ 



J2 uifi(HZi) T v) = J2 ^HKz^e) + ( + fa) 4»h - 6 i 

3<p 



i<N 



i<N 



c = E E ^/;(m^) t ^)m^) ] (vj - 6 S ). 

j<p \i<N 



Therefore, with Wj being defined as in ([T2 



with 



ICI < ^o^i E w i \°i - 9 i \ x m ^ x \ w j\> 



w,- = — i— £ vifKKZiFefoiZi). 



i<N 



It is easy to see that each Wj is Gaussian with mean and variance no greater than 1. As a result, 

Pr \ max | Wo | > t \ < pexp(-< 2 /2), t > 0. 

I 3<P I 



Given g G (0, 1), letting t = y/2\n(p/q) then finishes the proof. 



□ 
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4.2 Proof for Section U 

Proof of Theorem \3.1\ For A C {1, . . . ,p} and v £ W, denote by the vector u € with Uj = 
Vil{i £ A}. By definition of 6, 

L(9)-L{9) < £ [^(xTe.y,) - 7l (X 2 T 0,y)] +Ad(||0||i-||0||i). 

Let A = (1 + l/c)Mqd, where c > is to be determined. Then, writing r = 1/c, on the event that 
(fI7|) holds, 

- L(0) < Af g d [||0 - + (1 + r)(||0||i - ||0||i) 
Fix any J containing spt(0). Then 

||?-ff|| 1 + (l + r)(||fl||i-||?l|i) 

= Ei g '- e «i + Ei s *i + ( 1+r ) (X>i-E&i-Ei 
= E - e >\ + r )(i^i - i$d] - r E $1 

< (2 + r)||g*j-fl||i-r||flj.||i. 

On the one hand, the above inequalities yield 

L(0) - L{9) < M q d{2 + l/c)\\dj - 0\\i, (40) 

and on the other, since by definition of 9, L(9) > L(9), 

\\djA\i<{l + 2c)\\9 J -9\\ 1 . (41) 

Set c = (K-l)/2. Then A = (l + l/c)M q d is as in (jig). By jl5]), dTBJ) and gO), for any J D spt(6>) 
with |J| < 2||0|| o , 

NC 7 k 2 \\9j -6g< L(9) - L(9) < ^^\\9j - 9^. 

A — 1 

Since \\9j - 9\\ x < y/\T\\\6j - 9\\ 2 , it follows that 

fa-°h<WW with b=%x c J™_ iy (42) 

Let A be the set of indices % ^ spt(0) corresponding to the ||0||o largest \9i\. Then (1421) holds for 
both Jo = spt(0) and Ji = spt(#) U A. It is well known that (cf. [5]) 

IIA II 2 <r H^£g]ll 

By (|4"Tj) and Cauchy-Schwartz inequality followed by (l4"2"l) . 

11^113 < K ^ P ;!L nl < K 2 Wh, -8\\l< K 2 b 2 \\9\\ . 

Combining this with (|4"2"j) applied to J = J\ , 

ll« - - |fe - 0||1 + ||^/ f \\l < b 2 \Ji\ + K 2 b 2 \\9\\ = (2 + A- 2 )6 2 ||0|| o . 
So we finally arrive at (TT9)) . □ 
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4.3 Proof of Lemmas 

Proof of Lemma \jA\ If m = 0, then ifii(t) — fife + t) — fife). From Assumptions [T] and [3J the result 
is straightforward. 

Let to > 1. For t > 0, by Taylor expansion with an integral remainder, 

fife +*)-£ = T^yy /V - *) m - 1 [/ < (w) (c i + *) - /^(Ci)] d S) (43) 

k<rn JO 

yielding 

= t^—tt; f\t - sr-'iftH* +*)- f ( r ] fe)} da. 



[m- 1)1 Jo 

Therefore, by Assumption [l] on the one hand, 



\<Pi(t)\ < 



f-m ft op 

. / (2F m )(t- s r i d S 

Jo 



(m — 1)1 Jo to! 
and on the other, 

t ~ m f n „\m-ifT? „\j„ F m+1 \t\ 



\<Pi{t)\<- -ry / (t-sr-^F m+1 s)ds- 

(m - 1)! J (to + 1) 



The inequalities hold likewise for t < 0. Therefore, (|27l) holds. The above inequality also implies that 
ifi is continuous at 0. It is clear that ipi(t) is continuous at t ^= 0. Thus tp,; G C(dj — Cj, 6^ — Cj). 

It remains to show H^illup < V'- Since ifi is diffcrcntiable at t ^ 0, it is enough to show |^(£)| < ij) 
for t ^ 0. First, let m = 1. For t ^ 0, 

¥><(*) = t~ 2 [fife) - fife +t) + tfife + t)} = r 2 f[fife + t) - fife +t-s)] ds. 



By Assumption^ \f!fe + t) - fife +t - s)\ < F 2 \s\. Consequently |<^(f)| < F 2 /2 = ip. 
Finally, let to > 2. Define g(t) — mfife + t) — tf[fe + t). Then for k < to, 

0<*>(t) = (to - >(c, + 1) - tf^fe + t) 

and then 



m-1 , , s .(fc) . 



(m-k)f^'fe)t k 



= -t- 111 - 1 mhfe +t)- tfife + t)-J2 



fc! 

fc=0 



(m-2)\J 



(t-sr- 2 [g^- 1 \s)-g^- 1 H0)}ds, 



where the last equality is by similar Taylor expansion as (|43[) , now applied to g with order to — 1 . For 
each s, 

5 (™-D (s) _ 5 (™-i) (0 ) = f^-Qfr +s) - sf ^\ Ci + S ) _ /( m - x )( Ci ) 

= r[/i m) ( Cl + s - U )-/i m) ( Cl + S )]d U , 

Jo 
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giving |5 (m - x) (s) -5 (m_1) (0)l < F m +is 2 / 2 - Then 



f — ill — 1 J? ft z? 



2(m - 2)! Jo ml 
This finishes the proof. □ 
Proof of Lemma \4-2\ Let a; = ErnaXj< p |e V^|. By Jensen inequality, for any t > 0, 



exp(te) < E 
Since E T Vi~W(0,||^IH), 



exp f t max |e Vi 
1 i<p 



maxexp(t|e T Vj |) 



<^E[exp(i|e T ^| 



E[exp(t|e T ^|)] < E[exp(te T ^)] + E[exp(-te T V,)] = 2 exp(t||^ |||) 

Then 



exp(ia:) < 2pexp rmax||Vj-||2 
V j<p 

The proof is finished by letting t = x/(2 maxj< p || Vj|||). □ 
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